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A LAVA ATTACK ON THE RECOVERY OF SUMS OF DENSE AND 

SPARSE SIGNALS 

VICTOR CHERNOZHUKOV, CHRISTIAN HANSEN, AND YUAN LIAO 


Abstract. Common high-dimensional methods for prediction rely on having either 
a sparse signal model, a model in which most parameters are zero and there are a 
small number of non-zero parameters that are large in magnitude, or a dense signal 
model, a model with no large parameters and very many small non-zero parameters. 
We consider a generalization of these two basic models, termed here a “sparse+dense” 
model, in which the signal is given by the sum of a sparse signal and a dense signal. 
Such a structure poses problems for traditional sparse estimators, such as the lasso, 
and for traditional dense estimation methods, such as ridge estimation. We propose a 
new penalization-based method, called lava, which is computationally efficient. With 
suitable choices of penalty parameters, the proposed method strictly dominates both 
lasso and ridge. We derive analytic expressions for the finite-sample risk function of 
the lava estimator in the Gaussian sequence model. We also provide an deviation 
bound for the prediction risk in the Gaussian regression model with fixed design. 
In both cases, we provide Stein’s unbiased estimator for lava’s prediction risk. A 
simulation example compares the performance of lava to lasso, ridge, and elastic net in 
a regression example using feasible, data-dependent penalty parameters and illustrates 
lava’s improved performance relative to these benchmarks. 


Key words: high-dimensional models, penalization, shrinkage, non-sparse signal re¬ 
covery 


1. Introduction 

Many recently proposed high-dimensional modeling techniques build upon the funda¬ 
mental assumption of sparsity. Under sparsity, we can approximate a high-dimensional 
signal or parameter by a sparse vector that has a relatively small number of non-zero com¬ 
ponents. Various tq-based penalization methods, such as the lasso and soft-thresholding, 
have been proposed for signal recovery, prediction, and parameter estimation within a 
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(2009), 

Meinshausen and Yu 

(2009), Wainwright ( 

2009), 
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0), Zhang ( 

2010), i 

ioh and Wainwrighl 

(2013), and others. By virtue 


of being based on t\ -penalized optimization problems, these methods produce sparse 
solutions in which many estimated model parameters are set exactly to zero. 


We are grateful to Garry Chamberlain, Guido Imbens, Anna Mikusheva, Philippe Rigollet for helpful 
discussions. 
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Another commonly used shrinkage method is ridge estimation. Ridge estimation 
differs from the aforementioned ^-penalized approaches in that it does not produce 
a sparse solution but instead provides a solution in which all model parameters are 
estimated to be non-zero. Ridge estimation is thus particularly suitable when the model’s 
parameters or unknown signals contain many very small components, i.e. when the 
model is dense. See, e.g., Hsu et al. (2014). Ridge estimation tends to work better 
than sparse methods whenever a signal is dense in such a way that it can not be well- 
approximated by a sparse signal. 


In practice, we may face environments that have signals or parameters which are nei¬ 
ther dense nor sparse. The main results of this paper provide a model that is appropriate 
for this environment and a corresponding estimation method with good estimation and 
prediction properties. Specifically, we consider models where the signal or parameter, 8, 
is given by the superposition of sparse and dense signals: 


8 = P + ^ • ( 1 . 1 ) 

dense part s P arse P art 

Here, <5 is a sparse vector that has a relatively small number of large entries, and (3 is 
a dense vector having possibly very many small, non-zero entries. Traditional sparse 
estimation methods, such as lasso, and traditional dense estimation methods, such as 
ridge, are tailor-made to handle respectively sparse signals and dense signals. However, 
the model for 8 given above is “sparse+dense” and cannot be well-approximated by 
either a “dense only” or “sparse only” model. Thus, traditional methods designed for 
either sparse or dense settings are not optimal within the present context. 

Motivated by this signal structure, we propose a new estimation method, called “lava”. 
Let K(data, 8) be a general statistical loss function that depends on unknown parameter 
8, and let p be the dimension of 8. To estimate 8, we propose the “lava” estimator given 
by 

9^ = P + 8 ( 1 . 2 ) 

where /3 and 5 solve the following penalized optimization problem: 

0,6) = arg min |f(data, ft + 8) + \ 2 WWl + Ai||5||i). (1.3) 


In the formulation of the problem, A 2 and Ai are tuning parameters corresponding to the 
I 2 - and l\- penalties which are respectively applied to the dense part of the parameter, 
/?, and the sparse part of the parameter, 5. The resulting estimator is then the sum 
of a dense and a sparse estimator. Note that the separate identification of (3 and 5 
is not required in (1.1), and the lava estimator is designed to automatically recover 
the combination (3 + <5 that leads to the optimal prediction of (3 + <5. Moreover, under 
standard conditions for ^i-optimization, the lava solution exists and is unique. In naming 
the proposal “lava”, we emphasize that it is able, or at least aims, to capture or wipe 
out both sparse and dense signals. 

The lava estimator admits the lasso and ridge shrinkage methods as two extreme cases 
by respectively setting either A 2 = 00 or Ai = 00Q In fact, it continuously connects the 


1 With Ai = 00 or A 2 = 00 , we set Ai||d|| 1 = 0 when 5 = 0 or A 2 1|/3 ||2 = 0 when /3 = 0 so the problem 
is well-defined. 
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two shrinkage functions in a way that guarantees it will never produce a sparse solution 
when A 2 < 00 . Of course, sparsity is not a requirement for making good predictions. 
By construction, lava’s prediction risk is less than or equal to the prediction risk of the 
lasso and ridge methods with oracle choice of penalty levels for ridge, lasso, and lava; 
see Figure [lj Lava also tends to perform no worse than, and often performs significantly 
better than, ridge or lasso with penalty levels chosen by cross-validation; see Figures [4] 
and[H 

Note that our proposal is rather different from the elastic net method, which also 
uses a combination of t\ and £2 penalization. The elastic net penalty function is 0 1 —>• 
A 2 II 0 H 2 + Ai||0||i, and thus the elastic net also includes lasso and ridge as extreme cases 
corresponding to A 2 = 0 and Ai = 0 respectively. In sharp contrast to the lava method, 
the elastic net does not split 6 into a sparse and a dense part and will produce a sparse 
solution as long as Ai > 0. Consequently, the elastic net method can be thought of as 
a sparsity-based method with additional shrinkage by ridge. The elastic net processes 
data very differently from lava (see Figure 2 below) and consequently has very different 
prediction risk behavior (see Figure 1 below). 

We also consider the post-lava estimator which refits the sparse part of the model: 

^post-lava = T (1-4) 

where 8 solves the following penalized optimization problem: 

8 = argmm |^(data, j3 + 8) : 8j = 0, if 8j = Clj. (1.5) 


This estimator removes the shrinkage bias induced by using the i\ penalty in estima¬ 
tion of the sparse part of the signal. Removing this bias sometimes results in further 
improvements of lava’s risk properties. 


We provide several theoretical and computational results about lava in this paper. 
First, we provide analytic expressions for the finite-sample risk function of the lava esti¬ 
mator as well as for other methods in the Gaussian sequence model and in a fixed design 
regression model with Gaussian errors. Within this context, we exhibit “sparse+dense” 
examples where lava significantly outperforms both lasso and ridge. Stein’s unbiased 
risk estimation plays a central role in our theoretical analysis, and we thus derive Stein’s 
unbiased risk estimator (SURE) for lava. We also characterize lava’s “Efron’s” degrees 
of freedom (Efron (2004)). Second, we give deviation bounds for the prediction risk of 
the lava estimator in regression models akin to those derived by Bickel et al. (2009) for 
lasso. Third, we illustrate lava’s performance relative to lasso, ridge, and elastic net 
through simulation experiments using penalty levels chosen via either minimizing the 
SURE or by k-fold cross-validation for all estimators. In our simulations, lava outper¬ 
forms lasso and ridge in terms of prediction error over a wide range of regression models 
with coefficients that vary from having a rather sparse structure to having a very dense 
structure. When the model is very sparse, lava performs as well as lasso and outperforms 
ridge substantially. As the model becomes more dense in the sense of having the size of 
the “many small coefficients” increase, lava outperforms lasso and performs just as well 
as ridge. This is consistent with our theoretical results. 
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Theoretical risk 



Risk relative to lava 



FIGURE 1. Exact risk and relative risk functions of lava, post-lava, ridge, lasso, elas¬ 
tic net, and maximum likelihood in the Gaussian sequence model with “sparse+dense” 
signal structure, using the oracle (risk minimizing) choices of penalty levels. See Sec¬ 
tion 2.5 for the description of the model. The size of “small coefficients” is shown on 
the horizontal axis. The size of these coefficients directly corresponds to the size of the 
“dense part” of the signal, with zero corresponding to the exactly sparse case. Relative 
risk plots the ratio of the risk of each estimator to the lava risk. Note that the relative 
risk plot is over a smaller set of sizes of small coefficients to accentuate comparisons 
over the region where there are the most interesting differences between the estimators. 
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We conclude the introduction by noting that our proposed approach complements 
other recent approaches to structured sparsity problems such as those considered in 


fused sparsity estimation (Tibshirani et al. 

(2005) and Chen and Dalalyan 

(2012)) and 

structured matrix estimation problems ( 

Candes et al. 

(2011), Chandrasekaran et al. 

(2011) 

Fan et al. 

(2013 

), and 

Klopp et al. 

(2014)). The latter line of research studied 


estimation of matrices that can be written as low rank plus sparse matrices. Our new 
results are related to but are sharply different from this latter line of work since our 
focus is on regression problems. Specifically, our chief objects of interest are regression 
coefficients along with the associated regression function and predictions of the outcome 
variable. Thus, the target statistical applications of our developed methods include pre¬ 
diction, classification, curve-fitting, and supervised learning. Another noteworthy point 
is that it is impossible to recover the “dense” and “sparse” components separately within 
our framework; instead, we recover the sum of the two components. By contrast, it is 
possible to recover the low-rank component of the matrix separately from the sparse 
part in some of the structured matrix estimation problems. This distinction serves to 
highlight the difference between structured matrix estimation problems and the frame¬ 
work discussed in this paper. Due to these differences, the mathematical side of our 
analysis needs to address a completely different set of issues than are addressed in the 
aforementioned structured matrix estimation problems. 

We organize the rest of the paper as follows. Section 2 defines the lava shrinkage 
estimator in a canonical Gaussian sequence model, and derives its theoretical risk func¬ 
tion. Section 3 defines and analyzes the lava estimator in the regression model. Section 
4 provides computational examples, and Section 5 concludes. We give all proofs in the 
appendix. 

Notation. The notation a n < b n means that a n < Cb n for all n, for some constant C 
that does not depend on n. The £2 and £\ norms are denoted by || • ||2 (or simply || • ||) and 
|| • || 1 , respectively. The £q- “norm”, || • ||o, denotes the number of non-zero components of a 
vector, and the ||.||oo norm denotes a vector’s maximum absolute element. When applied 
to a matrix, || • || denotes the operator norm. We use the notation a V b = max(a, b ) and 
a Ab = min(a, b). We use x' to denote the transpose of a column vector x. 

2. The lava estimator in a canonical model 

2.1. The one dimensional case. Consider the simple problem where a scalar random 
variable is given by 

Z = 9 + e, e ~ 1V(0, a 2 ). 

We observe a realization z of Z and wish to estimate 9. Estimation will often involve the 
use of regularization or shrinkage via penalization to process input z into output d(z), 
where the map z H > d(z ) is commonly referred to as the shrinkage (or decision) function. 
A generic shrinkage estimator then takes the form 9 = d(Z). 

The commonly used lasso method uses G-penalization and gives rise to the lasso or 
soft-thresholding shrinkage function: 

diasso(~) = argmin |(z - Q) 2 + A;|0|} = (|z| - A;/ 2 ) + sign( 2 :), 
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where y + := max(y, 0) and A; > 0 is a penalty level. The use of the ^-penalty in place 
of the t\ penalty yields the ridge shrinkage function: 

z 


bridge (z) = argmin \(z - 9) 2 + A r |6>| 2 ) 


1 + A r 

where A r > 0 is a penalty level. The lasso and ridge estimators then take the form 

d]a.sso — d \asso (Z), bridge — bridge (^)- 

Other commonly used shrinkage methods include the elastic-net (Zou and Hastie (2005)), 
which uses 9 ha A 2 10| 2 + Ai |6>| as the penalty function; hard-thresholding; and the SCAD 
(Fan and Li (2001)), which uses a non-concave penalty function. 

Motivated by points made in the introduction, we proceed differently. We decompose 
the signal into two components 

9 = P + 5, 

and use the different penalty functions - the ^2 and i\ - for each component in order to 
predict 9 better. We thus consider the penalty function 

(/3,5) ha A 2 |/3| 2 + A x |<5|, 

and introduce the “lava” shrinkage function z ha d\ &Y& {z) defined by 

du V a(z) ■= d 2 (z) + d 1 (z), 

where d\(z) and d 2 (z) solve the following penalized prediction problem: 

(d 2 (z),di(z)) := arg min Uz - P - 5} 2 + X 2 \P\ 2 + Ai|<5|). 

(/3,<5)eK 2 t > 

Although the decomposition 9 = P + 5 is not unique, the optimization problem ( |2.2| ) has 
a unique solution for any given (Ai, A 2 ). The proposal thus defines the lava estimator of 
9: 

$lava = ^lava(^)- 

For large signals such that |*| > Xi/(2k), lava has the same bias as the lasso. This 
bias can be removed through the use of the post-lava estimator 

^post—lava — Q^post—lava(^) i 

where dpost-iava(£) := d 2 (z) + d\(z), and d\(z) solves the following penalized prediction 
problem: 


( 2 . 1 ) 

( 2 . 2 ) 


d\{z) := argmin j [z — d 2 (z) — d>] 2 : 5 = 0 if d±(z) = o|. 


(2.3) 

The removal of this bias will result in improved risk performance relative to the original 
estimator in some contexts. 


From the Karush-Kuhn-Tucker conditions, we obtain the explicit solution to (2.1). 
Lemma 2.1. For given penalty levels X± > 0 and X 2 > 0: 


- k)z + k(\ 

z\ - Ai/(2fc)) + sign(z) 

(2.4) 

z - Ai/2, 

z > Ai/(2 k) 


(1 - k)z, 

—Ai/(2 k) <z< Ai/(2 k) 

(2.5) 

z + Ai/2, 

z < —Ai/(2 k) 
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where k := 1+ 2 ^ . The post-lava shrinkage function is given by 


dpost-lava. (~ ) 


z, \z\ > Xi/(2k), 

(1 — k)z, \z\ < Xi/(2k). 


Figure [2] plots the lava shrinkage function along with various alternative shrinkage 
functions for z > 0. The top panel of the figure compares lava shrinkage to ridge, lasso, 
and elastic net shrinkage. It is clear from the figure that lava shrinkage is different from 
lasso, ridge, and elastic net shrinkage. The figure also illustrates how lava provides a 
bridge between lasso and ridge, with the lava shrinkage function coinciding with the 
ridge shrinkage function for small values of the input 2 and coinciding with the lasso 
shrinkage function for larger values of the input. Specifically, we see that the lava 
shrinkage function is a combination of lasso and ridge shrinkage that corresponds to 
using whichever of the lasso or ridge shrinkage is closer to the 45 degree line. 


It is also useful to consider how lava and post-lava compare with the post-lasso or 
hard-thresholding shrinkage: dpost-lasso^) = ~1{M > A//2}. These different shrinkage 
functions are illustrated in the bottom panel of Figure [2} 


From (2.4), we observe some key characteristics of the lava shrinkage function: 


1) The lava shrinkage admits the lasso and ridge shrinkages as two extreme cases. 
The lava and lasso shrinkage functions are the same when A 2 = 00 , and the ridge and 
lava shrinkage functions coincide if Ai = 00 . 

2) The lava shrinkage function d\ ava (z) is a weighted average of data z and the lasso 
shrinkage function diasso(^) with weights given by 1 — k and k. 

3) The lava never produces a sparse solution when A 2 < 00 : If A 2 < 00 , di ava (z) = 0 
if and only if z = 0. This behavior is strongly different from elastic net which produces 
a sparse solution as long as Ai > 0. 

4) The lava shrinkage function continuously connects the ridge shrinkage function 
and the lasso shrinkage function. When \z\ < Ai/(2/c), lava shrinkage is equal to ridge 
shrinkage; and when \z\ > X\/(2k), lava shrinkage is equal to lasso shrinkage. 

5) The lava shrinkage does exactly the opposite of the elastic net shrinkage. The elastic 
net shrinkage function coincides with the lasso shrinkage function when \z\ < Xi/(2k); 
and when \z\ > Ai/(2 k), the elastic net shrinkage is the same as ridge shrinkage. 


2.2. The risk function of the lava estimator in the one dimensional case. In 

the one-dimensional case with Z ~ N(8,cr 2 ), a natural measure of the risk of a given 
estimator 8 = d(Z ) is given by 

R(8,8) = E [d{Z)-8] 2 

= -a 2 + E(Z-d{Z)) 2 + 2E[(Z-8)d{Z)}. (2.6) 

Let P q & denote the probability law of Z. Let <fe,a be the density function of Z. We 
provide the risk functions of lava and post-lava in the following theorem. We also present 
the risk functions of ridge, elastic net, lasso, and post-lasso for comparison. 

Theorem 2.1 (Risk Function of Lava and Related Estimators in the Scalar Case). 
Suppose Z ~ N(8,a 2 ). Then for w = Ap/(2 k), k = A 2 /(l + A 2 ), h = 1/(1 + A 2 ), 
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FIGURE 2. Shrinkage functions. Here we plot shrinkage functions implied by lava 
and various commonly used penalized estimators. These shrinkage functions cor¬ 
respond to the case where penalty parameters are set as A 2 = A r = 1/2 and 
Ai = A; = 1/2. In each figure, the light blue dashed line provides the 45 degree 
line coinciding to no shrinkage. 


— Ai/(2(1 + A 2 )) — 0 and g = Ai/(2(1 + A 2 )) — 0, we have 

R{9, 0iava) = ~k 2 (w + 9)(j) e ,a(w)a 2 + k 2 (0 - w)4>e,a(-w)a 2 

+(A?/4 + <J 2 )?eA\Z\ >A + (0 2 k 2 + (1 - fc)V) PoA\Z\ < 

R{9, ^post-lava) = (J 2 [—k 2 w + 2 kw - k 2 9}(j) e A w ) + (y 2 [—k 2 w + 2 kw + k 2 9]4>o ta (-w) 

+a 2 P e A\ z \ > w ) + (k 2 0 2 + (1 " k) 2 a 2 )P e A\Z\ < w), 

R{9, 0iasso) = -(Az/2 + 9)cj)o,a{\i/2)cr 2 + (9 - Ai/2)</> e , CT (-V2)a 2 
+(A 2 /4 + a 2 )P e A\Z\ > Az/2) + 0 2 P e A\ Z \ < V2), 
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R(9, ^post-iasso) — (Az/2 — 9)4>Q^(\i/2)a~ + (A;/2 + 9)(f>Q >a (—\i/2)a 2 

+a 2 P e A\ z \ > A/2\+ 0 2 P e A\ z \ < A/2), 

R(9 , bridge) = 9 2 k 2 + (1 - fc)V, k = A/0- + A r ), 

R{9, ^elastic net) = A(h 2 \l/2 + h 2 Q + 2dll)(j)Q )(T (A/2) 

-a 2 (-h 2 A/2 + h 2 9 + 2gh)(t>eA-A/2) + 9 2 P e A\ z \ < A/2 ) 

+{{h9 + d) 2 + h 2 a 2 )P e A z > A/2 ) 

+((h9 + g) 2 + h 2 a~)PQ jCT (Z < — Ai/2). 

These results for the one-dimensional case provide a key building block for results in 
the multidimensional case provided below. In particular, we build from these results 
to show that the lava estimator performs very favorably relative to, and can substan¬ 
tially dominate, the maximum likelihood estimator, the ridge estimator, and i'l-based 
estimators (such as lasso and elastic-net) in interesting multidimensional settings. 


2.3. Multidimensional case. We consider now the canonical Gaussian model or the 
Gaussian sequence model. In this case, we have that 

Z ~N p (9, a 2 I p ) 


is a single observation from a multivariate normal distribution where 9 = (9i,...,9 P Y 
is a p-dimensional vector. A fundamental result for this model is that the maximum 
likelihood estimator Z is inadmissible and can be dominated by the ridge estimator and 
related shrinkage procedures when p > 3 (e.g. Stein (1956)). 

In this model, the lava estimator is given by 


Stein (1956 


$lava •— (^lava,l> $lava,p) •— ( d\ ava { Z \ ), ■ • ■ ■ 0 ava ( Z p ) ) , 


where d^AO is the lava shrinkage function as in 
capture the case where 

9= f3 + JA 

dense part s P arse P art 

is formed by combining a sparse vector 5 that has a relatively small number of non-zero 
entries which are all large in magnitude and a dense vector /3 that may contain very 
many small non-zero entries. This model for 9 is “sparse+dense.” It includes cases that 
are not well-approximated by “sparse” models - models in which a very small number 
of parameters are large and the rest are zero - or by “dense” models - models in which 
very many coefficients are non-zero but all coefficients are of similar magnitude. This 
structure thus includes cases that pose challenges for estimators such as the lasso and 
elastic net that are designed for sparse models and for estimators such as ridge that are 
designed for dense models. 


(2.5). The estimator is designed to 


Remark 2.1. The regression model with Gaussian noise and an orthonormal design is 
a special case of the multidimensional canonical model. Consider 

Y = X9 + U, U \X~ N{ 0, a 2 u I n ), 

where Y and U are n x 1 random vectors and X is an n x p random or fixed matrix, 
with n and p respectively denoting the sample size and the dimension of 9. Suppose 
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-X'X = I p a.s.. with p < n. Then we have the canonical multidimensional model: 


11 Z 

Z = 6 + e , Z =-X'Y, e = —X'U ~ N(0,a 2 I p ), a 2 = ^. 

n n n 


All of the shrinkage estimators discussed in Section 2.1 generalize to the multidimen¬ 
sional case in the same way as lava. Let z H > d e (z) be the shrinkage function associated 
with estimator e in the one dimensional setting where e can take values in the set 

£ = {lava,post-lava, ridge, lasso, post-lasso, elastic net}. 

We then have a similar estimator in the multidimensional case given by 

e e := (e e , h ...,9 e ,pY ■= (d e (Z 1 ),...,d e (Z p )y. 

The risk calculation from the one dimensional case then caries over to the multidimen¬ 
sional case since 

p 

R (0, e e ) := E||0 - 6e III = Y, Kil 

3 = 1 

Given this fact, we immediately obtain the following result. 


Theorem 2.2 (Risk Function of Lava and Related Estimators in the Multi-Dimensional 
Case). If Z ~ A r (0, o 2 I p ), then for any e E £ we have that 


R(Me) = i>(0j,0ej), 

3 =1 


where R(-,-) is the uni-dimensional risk function characterized in Theorem 2.1 


These risk functions are illustrated in Figure 1 in a prototypical “sparse+dense” model 
generated according to the model discussed in detail in Section 2.5. The tuning param¬ 
eters used in this figure are the best possible (risk minimizing or oracle) choices of the 
penalty levels found by minimizing the risk expression given in Theorem |2.2| 


2.4. Canonical plug-in choice of penalty levels. We now discuss simple, rule-of- 
thumb choices for the penalty levels for lasso (A;), ridge (A r ) and lava (Ai,A 2 ). In the 
Gaussian model, a canonical choice of A/ is 


A i = 2u4* R1 - c/(2 p)), 


which satisfies 


P ( max \Z,; — 6j I 
V 3<P 


< Xi/2 >1 - c; 


see, e.g., Donoho and Johnstone (1995). Here <&(•) denotes the standard normal cumu¬ 
lative distribution function, and c is a pre-determined significance level which is often 
set to 0.05. The risk function for ridge is simple, and an analytic solution to the risk 
minimizing choice of ridge tuning parameter is given by 


A, ■ — (7 
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As for the tuning parameters for lava, recall that the lava estimator in the Gaussian 
model is 


01ava 


(Pj ) $j ) 


(^lava,l> fila.va./)) > @lava,j — “I - j — lj • •• -P- 

arg ,« ™\ n D2 ( Z i “ & - 5 i) 2 + a 2II 2 + Xi\8j\. 

(Pi,5j)e R 2 


If the dense component ft were 
would suggest setting 


known, then following Donoho and Johnstone (1995) 


Ai = 2cr& 1 (l — c/(2p)) 

as a canonical choice of Ai for estimating 5. If the sparse component 5 were known, one 
could adopt 


a 2 = <7 2 (p/||/?||!) 


as a choice of A 2 for estimating /3 following the logic for the standard ridge estimator. 


We refer to these choices as the “canonical plug-in” tuning parameters and use them 
in constructing the risk comparisons in the following subsection. We note that the lasso 
choice is motivated by a sparse model and does not naturally adapt to or make use of 
the true structure of 6. The ridge penalty choice is explicitly tied to risk minimization 
and relies on using knowledge of the true 6. The lava choices for the parameters on the 
l\ and I 2 penalties are, as noted immediately above, motivated by the respective choices 
in lasso and ridge. As such, the motivations and feasibility of these canonical choices 
are not identical across methods, and the risk comparisons in the following subsection 
should be interpreted within this light. 


2.5. Some risk comparisons in a canonical Gaussian model. To compare the risk 
functions of lava, lasso, and ridge estimators, we consider a canonical Gaussian model, 
where 

0i=3, 0j = O.lg, j = 2,...,p, 

for some q > 0. We set the noise level to be a 2 = 0.1 2 . The parameter 6 can be 
decomposed as 6 = f3 + 5, where the sparse component is 6 = (3,0,..., 0)', and the dense 
component is 

P = (0,0.lg,..., O.lg)', 

where q describes the “size of small coefficients.” The canonical tuning parameters are 
A; = Ai = 2crT> —1 (1 — c/(2p)), A r = a 2 p/ (3 + 0.1 2 q 2 (p — 1)) and A 2 = a 2 p/(0.1 2 q 2 (p—l)). 

Figure [I] (given in the introduction) compares risks of lava, lasso, ridge, elastic net, 
and the maximum likelihood estimators as functions of the size of the small coefficients 
q , using the ideal (risk minimizing or oracle choices) of the penalty levels. Figure [ 3 ] 
compares risks of lava, lasso, ridge and the maximum likelihood estimators using the 
“canonical plug-in” penalty levels discussed above. Theoretical risks are plotted as a 
function of the size of the small coefficients q. We see from these figures that regardless 
of how we choose the penalty levels - ideally or via the plug-in rules - lava strictly 
dominates the competing methods in this “sparse+dense” model. Compared to lasso, 
the proposed lava estimator does about as well as lasso when the signal is sparse and does 
significantly better than lasso when the signal is non-sparse. Compared to ridge, the 
lava estimator does about as well as ridge when the signal is dense and does significantly 
better than ridge when the signal is sparse. 
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Theoretical risk 
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FIGURE 3. Exact risk functions of lava, post-lava, ridge, lasso, and maximum like¬ 
lihood in the Gaussian sequence model with “sparse+dense” signal structure, using 
the canonical “plug-in” choices of penalty levels. See Section 2.5 for the description 
of penalty levels and the model. The size of “small coefficients” is shown on the hori¬ 
zontal axis. The size of these coefficients directly corresponds to the size of the “dense 
part” of the signal, with zero corresponding to the exactly sparse case. Relative risk 
plots the ratio of the risk of each estimator to the lava risk, R(6f 0 e )/R($, ft ava ). Note 
that the relative risk plot is over a smaller set of sizes to accentuate comparisons over 
the region where there are the most interesting differences between the estimators. 


In Section 5 we further explore the use of feasible, data-driven choices of penalty 
levels via cross-validation and SURE minimization; see Figures [4] and 5. We do so in 
the context of the Gaussian regression model with fixed regressors. With either cross- 
validation or SURE minmization, the ranking of the estimators remains unchanged, with 
lava consistently dominating lasso, ridge, and the elastic net. 
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Stein (1956) proved that a ridge estimator strictly dominates maximum likelihood in 


the Gaussian sequence model once p > 3. In the comparisons above, we also see that the 
lava estimator strictly dominates the maximum likelihood estimator; and one wonders 
whether this domination has a theoretical underpinning similar to Stein’s result for ridge. 
The following result provides some (partial) support for this phenomenon for the lava 
estimator with the plug-in penalty levels. The result shows that, for a sufficiently large 
n and p, lava does indeed uniformly dominate the maximum likelihood estimator on the 
compact set {9 = (3 + 5 : ||/3||oc + Halloo < M}. 


Lemma 2.2 (Relative Risk of Lava vs. Maximum Likelihood ). Suppose Z N p (9, a 2 I p ), 
where 6 can he decomposed into 9 = /3 + 5 with s = 1{<^ / 0} < p. Let 

Ai and X 2 be chosen with the plug-in rule given in Section 2-4- Then uniformly for 
9 E {9 = (5 + 6 : ||0||oo + Halloo < Tf}, when a^/logp > 2 M + 33cr, M 2 logp > 16cr 2 , and 
7rc 2 log p > 1, we have 


E|| 9uUZ) ~ eg M\l SsM 2 4 / 7M \ 

n\Z-9\\l - a 2 p+\\/3\\ 2 ^ pa* + V^P 1 / 16 V ap 1 / 16 ) ' 

Remark 2.2. Note that 

p2 _ WWl 
d ° 2 p+ mi 

measures the proportion of the total variation of Z — 6 around 0 that is explained by 
the dense part of the signal. If is bounded away from 1 and M and a 2 > 0 are 
fixed, then the risk of lava becomes uniformly smaller than the risk of the maximum 
likelihood estimator on a compact parameter space as p —> 00 and s/p — > 0. Indeed, if 
R? d is bounded away from 1, then 


3 sM 2 4 

per 2 + v^P 1 / 16 


/ 7 M \ 

V dp 1 / 16 J 


RR = R? d + o(l) < 1. 


Moreover, we have RR —> 0 if R d —> 0. That is, the lava estimator becomes infinitely 
more asymptotically efficient than the maximum likelihood estimator in terms of relative 
risk. ■ 


2.6. Stein’s unbiased risk estimation for lava. Stein (1981) proposed a useful risk 


estimate based on the integration by parts formula, now commonly referred to as Stein’s 
unbiased risk estimate (SURE). This subsection derives SURE for the lava shrinkage in 
the multivariate Gaussian model. 

Note that 


Ell^lava - e\\l = -pa 2 + E|| Z- e X avail! + 2E[(Z - 9)%^]. 


(2.7) 


An essential component to understanding the risk is given by applying Stein’s formula 
to calculate E [(Z — 9)'9\ &va \. A closed-form expression for this expression in the one¬ 


dimensional case is given in equation (A.5) in the appendix. The following result provides 


the SURE for lava in the more general multidimensional case. 
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Theorem 2.3 (SURE for lava). Suppose Z = (Zi ,..., Z p )' ~ N p (6, u 2 I p ). Then 

p 

E[(Z - 0)'0 lava ] = p( 1 - k)a 2 + ka 2 ^ P^d^l > X 1 /(2k)). 

3 = 1 

In addition, let {Zij}f = 1 be identically distributed as Zj for each j. Then 

- n - n p 

R(0, 0 la va.) = (1 — 2k)p<7 2 + — "y ' || Zi — dlava(-^i) ||| + 2ko~ — EE mz^ > \i/(2k)}. 

i =1 2=1 j =1 

is an unbiased estimator o/R(0,6 , i ava ). 


3. Lava in the Regression Model 

3.1. Definition of Lava in the Regression Model. Consider a fixed design regression 
model: 

Y = xe 0 + U, U~N(0,oll n ), 

where Y = (y \,..., y n )', X = (X\,X n )' , and 9q is the true regression coefficient. 
Following the previous discussion, we assume that 

= A) + S 0 

is “sparse+dense” with sparse component do and dense component f3 o- Again, this coef¬ 
ficient structure includes cases which cannot be well-approximated by traditional sparse 
models or traditional dense models and will pose challenges for estimation strategies 
tailored to sparse settings, such as lasso and similar methods, or strategies tailored to 
dense settings, such as ridge. 

In order to define the estimator we shall rely on the normalization condition that 

n~ 1 [X'X\ jj = 1, j = l,...,p. (3.1) 

Note that without this normalization, the penalty terms below would have to be modified 
in order to insure equivariance of the estimator to changes of scale in the columns of X. 

The lava estimator #i ava of 8q solves the following optimization problem: 

f^lava • @ T S, 

0,S) = axg^min R2p |i ||Y - X+ 6)\\ 2 2 + \ 2 M\ 2 2 + \iW6\h\. (3.2) 

The lava program splits parameter 8 into the sum of fl and 5 and penalizes these two 
parts using the 1 2 and t\ penalties. Thus, the t\- penalization regularizes the estimator of 
the sparse part do of 80 and produces a sparse solution d. The ^-penalization regularizes 
the estimator of the dense part fto of 80 and produces a dense solution ( 3 . The resulting 
estimator of #0 is then simply the sum of the sparse estimator d and the dense estimator 

P- 
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3.2. A Key Profile Characterization and Some Insights. The lava estimator can 
be computed in the following way. For a fixed 5, we minimize 

/3(<5) = arg min j * ||K -X(J3 + <5)||| + A 2 ||/3||| j , 

with respect to f3. This program is simply the well-known ridge regression problem, and 
the solution is 

TO = (X'X + n\ 2 I p )- 1 X'(Y - XS). 

By substituting (3 = /3(d) into the objective function, we then define an ^-penalized 
quadratic program which we can solve for 6: 

6 = arg mm |i||T - X0(8) + 8)g + A 2 ||tolll + Ai||<5||ij . (3.3) 

The lava solution is then given by 6 = (3(5) + 5. The following result provides a useful 
characterization of the solution. 


Theorem 3.1 (A Key Characterization of the Profiled Lava Program). Define ridge- 
projection matrices, 


P\ 2 = X(X'X + n\ 2 Ip)- l X' and K Aa = I n - P Aa , 
and transformed data, 

Y = K 1 / 2 Y and X = K 1 / 2 X. 

A2 A2 


Then 



5 = arg min < 

[-||y-A5||l + A 1 ||<y|| 1 l 

(3.4) 


(5eiRp | 

l n J 

and 

A0i ava 

= P Aa Y + K A2 A 5. 

(3.5) 


The theorem shows that solving for the sparse part 5 of the lava estimator is equivalent 
to solving for the parameter in a standard lasso problem using transformed data. This 
result is key to both computation and our theoretical analysis of the estimator. 


Remark 3.1 (Insights derived from Theorem 3.1). Suppose Jo were known. Let W = 
Y — X5q be the response vector after removing the sparse signal, and note that we 
equivalently have W = X/3q + U. A natural estimator for /3q in this setting is then the 
ridge estimator of W on A: 

to) = (X'X + nX 2 I p )~ 1 X'W. 


Denote the prediction error based on this ridge estimator as 

D ri dge(A 2 ) = A to) - Xl3o = ~ K A2 X(3o + Pa 2 U. 


Under mild regularity conditions on /3q and the design matrix, Hsu et al. (2014) showed 
that 


(l|D, Ms „(A 2 )|| 2 = op(l). 

Using Theorem |3.1| , the prediction error of lava can be written as 
toava - Xd 0 = P A2 Y + K A2 A? - A A) - A«5 0 = D ridge (A 2 ) + K Aa A (5 - So). (3.6) 
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Hence, lava has vanishing prediction error as long as 

1 


n 


K X2 X(5-S 0 )\\i = op(l). 


(3.7) 


Condition (3.7) is related to the performance of the lasso in the transformed problem 


(3.4). Examination of (3.4) shows that it corresponds to a sparse regression model with 


T7 T > ~ 1/2 

approximation errors X/3q: For U = K A ' 2 U, 

Y = XS 0 + U + K^ 2 X,3 0 . 


(3.8) 


Under conditions such as those given in Hsu et al. (2014), the approximation error obeys 

(3.9) 


i||K^ *A,||l = 0p (l). 


It is also known that the lasso estimator performs well in sparse models with vanishing 
approximation errors. The lasso estimator attains rates of convergence in the prediction 
norm that are the sum of the usual rate of convergence in the case without approximation 
errors and the rate at which the approximation error vanishes; see 


e-g., 


Belloni and 


Chernozhukov (2013). Thus, we anticipate that (3.7) will hold. 


To help understand the plausibility of condition (3.9), consider an orthogonal design 


where hX'X = L 


1/2 

p . In this case, it is straightforward to verify that K^' = K,\* where 


A£ = \fXi/ (\7l + A 2 — 1 /A 2 )• Hence, X/3o = K^* Xfio is a component of the prediction 
bias from a ridge estimator with tuning parameter A?) and is stochastically negligible. 
We present the rigorous asymptotic analysis for the general case in Section 3.5. ■ 

3.3. Degrees of Freedom and SURE. Degrees of freedom is often used to quantify 
model complexity and to construct adaptive model selection criteria for selecting tuning 

with a fixed 


parameters. In a Gaussian linear regression model Y ~ N(X0Q,a 
design, we can define the degrees of freedom of the mean fit X6 to be 

df(0) = \e[(y - xe 0 yxe}- 


2 Jn) 


see, e.g., Efron ([2004). Note that this quantity is also an important component of the 
mean squared prediction risk: 

E-\\Xe-X0 o \\l = -ol + E-\\Xe-Y\\l + ^df(0). 
n n " n 


Stein (1981 )’s SURE theory provides a tractable way of deriving an unbiased estimator 


of the degrees of freedom, and thus the mean squared prediction risk. Specifically, write 
9 = d(Y,X) as a function of Y, conditional on X. Suppose d(-,X ) : M n — > M p is 
almost differentiable; see Meyer and Woodroofe (2000) and Efron et al. ( 2004| )). For 
/ : W 1 —> R n differentiable at y , define 

d y f(y) ■= [dfij(y )], (i,j) € {1,..., n} 2 , dfij(y ) := /,(//)- 


V y • f(y) := tr (d y f(y)). 
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Let X- denote the i-th row of X, i = 1, n. 


Then, from Steinj (1981), we have that 


4 E [(y - Xe o yXd(Y,X)} = E[V y • (Xd{Y,X))\ = tr (d y [Xd(Y,X)]). 
T/ 


An unbiased estimator of the term on the right-hand-side of the display may then be 
constructed using its sample analog. 

In this subsection, we derive the degrees of freedom of the lava, and thus a SURE of 
its mean squared prediction risk. By Theorem |3.1| 

V y • (Ad lava (y, X)) = tr(P Aa ) + V y • (K Aa Ad lasso (K^ 2 y, K^ 2 X)) (3.10) 

= tr(P Aa ) + tr (k X2 <9 y [Ad lasso (K A (/y, A)]) , (3.11) 


1/2 1 /2 

where di ava (y, X) is the lava estimator on the data {y,X) and diasso(K A ' y, K A ( A")) is 

1 /o 1 /o 

the lasso estimator on the data (K A ' 2 y, K A ( A) with the penalty level Ai. The almost 

1/2 1 /2 

differentiability of the map y H > d\ asso (K A y, K A ' A) follows from the almost differen- 

1 /2 

tiability of the map u (-a di asS o(«, K A ( A ), which holds by the results in 
(2011) and Tibshirani and Taylor (2012). 


Dossal et al. 


The following theorem presents the degrees of freedom and SURE for lava. Let J = 
{j < p : 5j fz 0 } be the active set of the sparse component estimator with cardinality 
denoted by \J\. Recall that A = K Aa A. Let A j be an n x \J\ submatrix of A whose 

columns are those corresponding to the entries in J. Let A~ denote the Moore-Penrose 
pseudo-inverse of a square matrix A. 


Theorem 3.2 (SURE for Lava in Regression). Suppose Y N(Xe„,oll n ). Let 

K j = I-X j {.X'jXj)-X' } 

be the projection matrix onto the unselected columns of the transformed variables. We 
have that 

df(0iava) = E[rank(Aj) + tr(Kj P Aa )]. 

Therefore, the SURE of E^ | A#i ava — A6*o 111 given by 

1 ^ 9rr 2 ~ 9 rr 2 ~ 

-a 2 u + —1|A0 lava - U||2 + ^ r ank(A f) + —^tr(K ; P Aa ). 
n n J n J 


3.4. Post-lava in regression. We can also remove the shrinkage bias in the sparse 
component introduced by the Id-penalization via a post-selection procedure. Specifically, 
let (/3, 5) respectively denote the lava estimator of the dense and sparse components. 
Define the post-lava estimator as follows: 

^post-lava ft T d, 

6 = arg min { — IIY" — A/3 — A<5||o : <5, = 0 if 5j = 0 

Let Xj be an n x | J\ submatrix of A whose columns are selected by J . Then we can 
partition 5 = (Sj, 0)', where 5j = (X'jX j)~ X’j(Y - X$). Write Pj = Xj(X' } Xj)~X'j 
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and Kj = I n — P j. The post-lava prediction for X0 is: 

^post-lava = PjY + KjXj3. 

In addition, note that the lava estimator satisfies X/3 = P Aa (Y — X5). We then have the 
following expression of -A# post _i ava . 

Lemma 3.1. Let U .— Y -^f^iava* Then X ^post-lava — ARi ava ~P PjU• 


The above lemma reveals that the post-lava corrects the Id-shrinkage bias of the 
original lava fit by adding the projection of the lava residual onto the subspace of the 
selected regressors. This correction is in the same spirit as the post-lasso correction for 
shrinkage bias in the standard lasso problem; 


see 


Belloni and Chernozhukov (2013). 


Remark 3.2. We note that the SURE for post-lava may not exist, though an estimate 
of the upper bound of the risk function may be available, because of the impossibility 
results for constructing unbiased estimators for non-differentiable functions; see [Hirano 
and Porter (2012). ■ 


3.5. Deviation Bounds for Prediction Errors. In the following, we develop devia¬ 
tion bounds for the lava prediction error: b||X0i ava — JU 111 - AVe continue to work with 
the decomposition #o = A) + <d) and will show that lava performs well in terms of rates 
on the prediction error in this setting. According to the discussion in Section 3.2, there 
are three sources of prediction error: (i) D r idge(A 2 ), (ii) Xfio and (iii) K\ 2 X(5 — do). 
The behavior of the first two terms is determined by the behavior of the ridge estimator 
of the dense component j5 o, and the behavior of the third term is determined by the 
behavior of the lasso estimator on the transformed data. 


We assume that U ~ N( 0, 0 ^/ n ) and that X is fixed. As in the lasso analysis of Bickel 


et al. (2009), a key quantity is the maximal norm of the score: 


A = 

2 -x'u 

— 

-X' k A9 u 


n 

OO 

n 


Following Belloni and Chernozhukov (2013), we set the penalty level for the lasso part 
of lava in our theoretical development as 


Ai = cAi_ q with Ai_ a = infjZ 6 M : P(A < l) > 1 — a} (3-12) 

and c > 1 a constant. Note that Belloni and Chernozhukov] (2013) suggest setting c = 1.1 
and that Ai_ a is easy to approximate by simulation. 

Let S = X'X/n and V\ 2 be the maximum diagonal element of 

V\ 2 = (S + A 2l P )~ 1 S(S + A 2 I P r 1 Xl 
Then by the union bound and Mill’s inequality: 


■A-l— a ^ -A-l—a: •— 


V\ 2 log(2 p/a) 


n 


(3.13) 


Thus the choice Ai_ a is strictly sharper than the union bound-based, classical choice 
Ai_ a . Indeed, Ai_ q is strictly smaller than Ai_ q even in orthogonal design cases since 
union bounds are not sharp. In collinear or highly-correlated designs, it is easy to give 
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examples where Ai_ a = o(Ai_ a ); see Belloni et al. (2014). Thus, the gains from using 
the more refined choice can be substantial. 

~ 1/2 

We define the following design impact factor: For X = K A ' 2 X, 


l(c, Sq, Ai, A 2 ) := 


inf 


\XA\ 


n 


Ae7e(c,5 0 Ai,A 2 ) IIJolli — II ^0 + A||i + c 1 1 |A|| 1 


where TZ(c, 5 0 , Ai, A 2 ) = {A e M p \{ 0 } : \\XA\\%/n < 2A 1 (||<5 0 ||i — ||<5 0 + A||i + c 1 || A||i)} 
is the restricted set, and where i(c, Sq, Ai, A 2 ) := 00 if Sq = 0 . 


The design impact factor generalizes the restricted eigenvalues of Bickel et al. (2009) 


and and is tailored for bounding estimation errors in the prediction norm (cf. Belloni 


et al. (2014)). Note that in the best case, when the design is well-behaved and A 2 is a 

i(c, <5 q, Ai, A 2 ) > 


constant, we have that 


1 




(3.14) 


VPoiio 

where n > 0 is a constant. Remarks given below provide further discussion. 

The following theorem provides the deviation bounds for the lava prediction error. 

Theorem 3.3 (Deviation Bounds for Lava in Regression). We have that with probability 
1 — a — e 


-||X 0 lava -W?o||! 

n 


< — 1| X{5 - So)r 2 \\ K A2 || + — ||D rt dse(A 2 )|| 2 


n 


n 


< 


inf 

($0 1 ^ 0 )' SM 2p :<5 q +ho =@o 


{(r 1 (<5 0 )VB 2 


K 


A 2 


+ B 3 + B 4 


where |j K A2 || < 1 and 


Bi(S 0 ) = 


2 3 A? 


< 


2b °lc 2 Vl 2 log(2p/a) 

i 2 (c, fl 0 , Ai,A 2 ) “ m 2 (c, 6 0 , Ai,A 2 ) 

2 5 „.,i/2 


B 2 W 0 ) = - II Ki X/do\\l = 2 5 X 2 p' 0 S(S + X 2 I)- 1 f5 0 , 


2^2 


Bs = 


2 2 a; 


n 


\/ tr ( P a 2 ) + V^v4 p i 2 ll 


B*m = -II Ka 2 XPoWl = 2 2 / 3 'R A 2 / 3 o < 2 3 B 2 ((3 0 )\\ K Aa ||. 

Remark 3.3. As noted before, the “sparse+dense” framework does not require the sep¬ 
arate identification of (/Jo, So). Consequently, the prediction upper bound is the inhmum 
over all the pairs (/3q,S 0 ) such that (5q + 5o = Go- The upper bound thus optimizes 
over the best “split” of 60 into sparse and dense parts, <5o and /3q. The bound has four 
components. B\ is a qualitatively sharp bound on the performance of the lasso for - 
transformed data. It involves two important factors: V \ 2 and the design impact factor 
i(c, <5o> Ai, A 2 ). The term R 3 is the size of the impact of the noise on the ridge part of 
the estimator, and it has a qualitatively sharp form as in Hsu et al. (2014). The term 
I ?4 describes the size of the bias for the ridge part of the estimator and appears to be 
qualitatively sharp as in Hsu et al. (2014). We refer the reader to Hsu et al. (2014) for 
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the in-depth analysis of noise term £>3 and bias term B 4 . The term £> 2 1| K A2 || appearing 
in the bound is also related to the size of the bias resulting from ridge regularization. In 
examples like the Gaussian sequence model, we have 


64(A)) < 6 2 (A))|| K A2 


< 


K a 2 


K 


6 4 (/3 o). (3.15) 

< 1 , which occurs if A 2 stochas- 


a 2 


This result holds more generally whenever 
tically dominates the eigenvalues of S (see our supplementary material Chernozhukov 
et al. (2015) for detailed derivations). 


Remark 3.4 (Comments on Performance in Terms of Rates). It is worth discussing 
heuristically two key features arising from Theorem |3.3| 

1) In dense models where ridge would work well, lava will work similarly to ridge. 
Consider any model where there is no sparse component (so 6 q = A))> where the ridge- 


type rate B* = i? 4 (/3q) + £>3 is optimal (e.g. Hsu et al. (2014)), and where (3.15) holds. 


In this case, we have 64 (do) = 0 since do = 0, and the lava performance bound reduces 
to 

Wo) II Ka 2 II + 63 + Wo) < Wo) + 63 = B *. 

2) Lava works similarly to lasso in sparse models that have no dense components 
whenever lasso works well in those models. For this to hold, we need to set A 2 > n - 
Consider any model where 0q = do and with design such that the restricted eigenvalues 


k of Bickel et al. (2009) are bounded away from zero. In this case, the standard lasso 
rate 

11 do 110 log( 2 p/ a) 


B* = 


nK- 


of Bickel et al. (2009) is optimal. For the analysis of lava in this setting, we have that 
60(A)) = -64(A)) = 0 . Moreover, we can show that R 3 < n -1 and that the design impact 


factor obeys (3.14) in this case. Thus, 


61 (d 0 ) 


11 dp 110 log( 2 p/q) _ 

rv 9 ^ 


and (61 (d 0 ) V 6 2 (Aj))|| K A2 || + 63 + 64(A)) < 6 * follows due to || K Aa || < 1 . 

Note that we see lava performing similarly to lasso in sparse models and performing 
similarly to ridge in dense models in the simulation evidence provided in the next section. 
This simulation evidence is consistent with the observations made above. ■ 

Remark 3.5 (On the design impact factor). The definition of the design impact factor 


is motivated by the generalizations of the restricted eigenvalues of Bickel et al. (2009) 


proposed in Belloni et al. (2014) to improve performance bounds for lasso in badly 


behaved designs. The concepts above are strictly more general than the usual restricted 
eigenvalues formulated for the transformed data. Let J(do) = {j < p : doj 7 ^ 0}. For any 


vector AgP, 
J(do)}. Define 


respectively write A J(5o) = {A j : j € J(d 0 )} and Ajc (5o) = {Aj : j g 


Ac, So) = {v€ \ {0} : ||A JC (d 0 )||i < (c+l)/(c- 1)||A j(5o) || 1 }. 
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The restricted eigenvalue n 2 (c, do, A 2 ) is given by 
k 2 (c, <5o, A 2 ) = 


inf 

AeA(c,<5o) 


|AA|| 2 /n 


|A 


J( 6 o) 


inf 

AeA(c,(5 0 ) 


A' K A2 X/n 


IA 


0 ) 


Note that A(c, 60 ) C TZ(c, do, Ai, A 2 ) and that 

lAAII, 


i(c,5 0 ,Ai, A 2 ) > inf 

AeA{c,6 0 ) 


n 


IA 


> 


J(So)\\l 


v/P 


= ac(c, (5 0 , A 2 ). 


00 


Now note that X' K \ 2 X/n = A 25 '( 5 +A 2 /p) 1 ■ When A 2 is relatively large, X' K x 2 X/n-- 
A 2 S’(S’ + A 2 / p ) 1 is approximately equal to 5. Hence, k 2 (c, do, A 2 ) behaves like the usual 
restricted eigenvalue constant as in Bickel etal. (2009), and we have a good bound on the 


design impact factor t(c, 5o ; Ai, A 2 ) as in (3.14). To understand how k 2 (c, <5o, A 2 ) depends 
on A 2 more generally, consider the special case of an orthonormal design. In this case, 
S = I p and X'K \ 2 X/n = kl p with k = A 2 /(l + A 2 ). Then k 2 (c, <5o>A 2 ) = k, and the 
design impact factor becomes Vk/ -y/pojjo. 

Thus, the design impact factor scales like 1 /-\/||Ao||o when restricted eigenvalues are 
well-behaved, e.g. bounded away from zero. This behavior corresponds to the best 
possible case. Note that design impact factors can be well-behaved even if restricted 
eigenvalues are not. For example, suppose we have two regressors that are identical. 


Then k(c, < 5 o>A 2 ) = 0, but i(c, do, Ai, A 2 ) > 0 in this case; see Belloni et al. (2014)) 


4. Simulation Study 


The lava and post-lava algorithm can be summarized as follows. 

1. Fix Ai, A 2 , and define P Aa = X{X'X + nA 2 I p ) _1 A' / , K Az = I n — P Aa . 

2. For Y = KpY, and X = K^ 2 A, solve for 

i = a r gmm{l| | y-« | |i + A 1 p| | 1 }. 

3. Define (3(d) = (X'X + nA 2 / p ) _1 A / (y — Xd). The lava estimator is 

$lava = P(&) + 

4. For W = Y — X(3(d), solve for 

^=argmm<P||fH- X d\\ 2 2 , dj = 0 if d 3 = o|. 

5. The post-lava estimator is 

^post-lava — J> (d) -p d. 

We present a Monte-Carlo analysis based on a Gaussian linear regression model: Y = 
X9 + U, U | A ~ A(0, I n ). The parameter 9 is a p-vector defined as 

0 = (3,O,...,O) , + g(O,.l,...,.l) / , 

where q > 0 denotes the “size of small coefficients”. When q is zero or small, 9 can 
be well-approximated by the sparse vector (3,0, ...,0). When q is relatively large, 9 


















22 


VICTOR CHERNOZHUKOV, CHRISTIAN HANSEN, AND YUAN LIAO 


cannot be approximated well by a sparse vector. We set n = 100 and p = 2 n, and 
compare the performance of X6 formed from one of five methods: lasso, ridge, elastic 
net, lava, and post-lavaj^] The rows of X are generated independently from a mean zero 
multivariate normal with covariance matrix E. We present results under an independent 
design, E = /, and a factor covariance structure with E = LL' +1 where the rows of L 
are independently generated from N(0, Is). In the latter case, the columns of X depend 
on three common factors. We focus on a fixed design study, so the design X is generated 
once and fixed throughout the replications. 


To measure performance, we consider the risk measure R(0,0) = E[-||X0 — A’0|||] 
where the expectation E is conditioned on X. For each estimation procedure, we report 
the simulation estimate of this risk measure formed by averaging over B = 100 simulation 
replications. Figures [ 4 ] and 5 plot the simulation estimate of R(0, 9) for each estimation 
method as a function of q, the size of the “small coefficients”. In Figure [4] all the 
tuning parameters are chosen via minimizing the SURE as defined in Theorem 3.2; and 
the tuning parameters are chosen by 5-fold cross-validation in Figure 5. The SURE 
formula depends on the error variance er 2 , which must be estimated. A conservative 
preliminary estimator for <r 2 can be obtained from an iterative method based on the 
regular lasso estimator; see, e.g., Belloni and Chernozhukov (2013). On the other hand, 
/.■-fold cross-validation does not require a preliminary variance estimator. 


The comparisons are similar in both figures with lava and post-lava dominating the 
other procedures. It is particularly interesting to compare the performance of lava to 
lasso and ridge. The lava and post-lava estimators perform about as well as lasso when 
the signal is sparse and perform significantly better than lasso when the signal is non- 
sparse. The lava and post-lava estimators perform about as well as ridge when the 
signal is dense and perform much better than ridge when the signal is sparse. When 
the tuning parameters are selected via cross-validation, the post-lava performs slightly 
better than the lava when the model is sparse. The gain is somewhat more apparent 
in the independent design. Additional simulations are presented in our supplementary 
material 


(Chernozhukov et al. (2015 


5. Discussion 

We propose a new method, called “lava”, which is designed specifically to achieve 
good prediction and estimation performance in “sparse+dense” models. In such models, 
the high-dimensional parameter is represented as the sum of a sparse vector with a 
few large non-zero entries and a dense vector with many small entries. This structure 
renders traditional sparse or dense estimation methods, such as lasso or ridge, sub- 
optimal for prediction and other estimation purposes. The proposed approach thus 
complements other approaches to structured sparsity problems such as those considered 


in fused sparsity estimation 1 

Tibshirani et al. ( 

2005) and 

Chen and Dalalyan 

(2012)) 

and structured matrix decomposition problems 

Candes et al. 

(2011), 

Chandrasekaran 

et al. 

(2011) 

Fan et al. ( 

2013 

1 , and 

Klopp et al. 

(2014)). 


2 Results with p = n/2, where OLS is also included, are available in supplementary material. The re¬ 
sults are qualitatively similar to those given here, with lava and post-lava dominating all other procedures. 
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Approximated risk, p=2n, 2=1 




size of small coefficients 


Approximated risk, p=2n,2=LL’+l 



size of small coefficients 



size of small coefficients 


FIGURE 4. Simulation risk comparison with tuning done by minimizing SURE. In 
this figure, we report simulation estimates of risk functions of lava, post-lava, ridge, 
lasso, and elastic net in a Gaussian regression model with “sparse+dense” signal struc¬ 
ture over the regression coefficients. We select tuning parameters by minimizing SURE. 
The size of “small coefficients” is shown on the horizontal axis. The size of these co¬ 
efficients directly corresponds to the size of the “dense part” of the signal, with zero 
corresponding to the exactly sparse case. Relative risk plots the ratio of the risk of 
each estimator to the lava risk, R(0, 9 e )/R(9, #i ava ). 


There are a number of interesting research directions that remain to be considered. An 
immediate extension of the present results would be to consider semi-pivotal estimators 
akin to the root-lasso/scaled-lasso of Belloni et al. (2011) and Sun and Zhang (2012). 
For instance, we can define 


0 


root-lava • 


P + 6, 
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Approximated risk, p=2n, 2=1 



Risk relative to lava, p=2n, 2=1 



Approximated risk, p=2n, 2=LL’+I 



Risk relative to lava, p=2n, 2=LL’+I 



FIGURE 5. Simulation risk comparison with tuning done by 5-fold cross-validation. 
In this figure, we report simulation estimates of risk functions of lava, post-lava, ridge, 
lasso, and elastic net in a Gaussian regression model with “sparse+dense” signal struc¬ 
ture over the regression coefficients. We select tuning parameters by 5-fold cross- 
validation. The size of “small coefficients” is shown on the horizontal axis. The size 
of these coefficients directly corresponds to the size of the “dense part” of the signal, 
with zero corresponding to the exactly sparse case. Relative risk plots the ratio of the 
risk of each estimator to the lava risk, R($, 9 e )/R{9, 0i ava ). 


OM) : = 


arg mm 

PM 


2na 2 


\Y-X(J3 + S)\\% + 


(1 — a) a 


+ -^2 ||/ 3||2 + ^1 


Thanks to the characterization of Theorem 3.1, the method can be implemented by 


applying root-lasso on appropriately transformed data. The present work could also be 
extended to accommodate non-Gaussian settings and settings with random designs, and 
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it could also be extended beyond the mean regression problem to more general M- and 
Z- estimation problems, e.g., along the lines of Negahban et al. (2009). 


Appendix A. Proofs for Section 2 


2.1 


Fixing S, the solution for /3 is given by j3(5) = (z — S)/( 1 + 


A.l. Proof of Lemma 

A 2 ). Substituting back to the original problem, we obtain 

di(z) = argminfz — (3(5) — A ] 2 + A 2 |/3(<5 )| 2 + Ai|5| 
= argmin k(z — <5 ) 2 + Ai|<5|. 


Hence d\(z) = (|z| — Ai/(2fe))_|_sign(z), and d 2 (z) = P(di(z)) = (z\ — di(z))(l — k). 
Consequently, diava(^) = d\(z) + d 2 (z) = (1 — k)z + kd\(z). m 

A.2. A Useful Lemma. The proofs rely on the following lemma. 

Lemma A.l. Consider the general piecewise linear function: 

hz + d, z > w 
F(z) = ez + m, \z\ < w ■ 
fz + g , z < w 


Suppose Z ~ N(6,a 2 ). Then 


E [F(Z] 2 ] = [a 2 (h 2 w + h 2 0 + 2dh) — a 2 (e 2 w + e 2 0 + 2me)]^)g i(T (w) 

+[a 2 (-e 2 w + e 2 0 + 2 me) - a 2 (-f 2 w + f 2 0 + 2gf)](j)e,a{-w) 

+((h6 + df + h 2 a 2 )Pg :a (Z >w) + ((. f6 + gf + f 2 a 2 )Pg, a (Z < -w) 
+((e6 + m) 2 + e 2 a 2 )Pg )CT (\Z\ < w). 

Proof. We first consider an expectation of the following form: for any —00 < z\ < z 2 < 
00 , and a, b G M, by integration by part, 

E [6 — Z)(aZ + b)l{zi < Z < z 2 j = a 2 f — 5 —(az + b)<fg a (z)dz 

Jz! a 

f Z2 

= a‘ 2 (az + b)4>g tCT (z)\ Z z 2 i -a 2 a / fg^(z)dz 

J Z\ 

= a 2 [(az 2 + b)4>g :(7 (z 2 ) - (azi + b)<j>o^(z{)\ - a 2 aP e ^(z 1 < Z < z 2 ). (A.l) 

This result will be useful in the following calculations. Setting a = — 1, b = 6 and 
a = 0, b = —2(6 + c) respectively yields 

E(0 — Z) 2 l{z\ < Z < z 2 j = g 2 [(6 - z 2 )c/)g^(z 2 ) - (6 - z 1 )f>g tCT (z 1 )\ + a 2 Pg tCX (zi < Z < z 2 ). 

2E(Z -6)(6 + c)l{zi <Z<z 2 } = a 2 [-2(0 + c)cfg^(z 2 ) + 2(6 + c)<t> e , a (z i)]- 
Therefore, for any constant c, 

E (Z + c) 2 l{zi < Z < z 2 } = E (6 — Z) 2 l-[zi < Z < z 2 } + (6 + c) 2 Pg tCr (zi < Z < z 2 ) 

+2E(Z - 6)(6 + c)l{z! < Z < z 2 } 

= a 2 (z 1 + 6 + 2 c) 00 )O .(zi) - g 2 (z 2 + 6 + 2c)(fg^(z 2 ) 
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+ ((@ + c) 2 + C7^)Pg a (zi < Z < Z 2 ). (A.2) 


If none of h, e, f are zero, by setting z 1 = in, Z 2 = 00 , c = d/h] z\ = — 00 , Z 2 = — in, c = 
g/f and z\ = —in, Z 2 = in, c = m/e respectively, we have 

E (hZ + d) 2 l{Z > w} = a 2 (h 2 w + K 2 0 + 2dh)cf>g^(w) 

-\-{{9h + d)" + cr^/i 2 )P q^u(Z > re), 

E(/Z + <?) 2 1{Z < -in} = —<r 2 (—in/ 2 + 0/ 2 + 2g})<j> e ^{-w) (A.3) 

+ (( 0 / + ff) 2 + Af 2 )P e A Z < -in), 

E(eZ + m) 2 l{|Z| < in} = cr 2 (—me 2 + 0e 2 + 2me)4>e,a(— in) 

— a 2 (we 2 + 6e 2 + 2me)(j)eA w ) 

+{(9e + m) 2 + u 2 e 2 )P 0i(T (|Z| < in). (A.4) 


If any of h,e,f is zero, for instance, suppose h = 0, then E(h Z + d) 2 l{Z > in} = 
^ 2 Pe»,o-(^ > w), which can also be written as the first equality of (A.3). Similarly, when 
either e = 0 or / = 0, ( A.3 ) still holds. 

Therefore, summing up the three terms of (A.3) yields the desired result. ■ 


A.3. Proof of Theorem 


2.1 


Recall that $i ava = (1 — k)Z + kdi asBO (Z) is a weighted 
average of Z and the soft-thresholded estimator with shrinkage parameters X\/{2k) and 


k = X 2 /(1 + A 2 ). Since di asso (Z) is a soft-thresholding estimator, results from Donoho 
and Johnstone (1995) give that 

E [(Z - d)d ]aBBO (Z)\ = g 2 PqA\ z \ > Ai/(2 k)). 

Therefore, for in = Ai/(2 k), 

2 E [(Z - 0)d lava (Z)] = 2(1 - k)o 2 + 2ka 2 PeA\ z \ > A- 

Next we verify that 

E (Z - di a va(^)) 2 = -k 2 (w + d)4> e ^(w)a 2 + k 2 (9 - w)cj)g :<r (-w)a 2 


+(A 2 /4)P eiff (|Z| > in) + k 2 (9 2 + a 2 )PeA\Z\ < w). 


(A-5) 

(A.6) 


By definition, 


4wa(z) - z = < 


—Ai/2, z>X 1 /{2k), 

—kz , —X\/(2k)<z<Xi/(2k), 

Ai/2, z < —X\/(2k). 


(A.7) 


Let F(z) = diava(^) — 2. The claim then follows from applying Lemma A.l by setting 
h = f = m = 0,d = —Ai/2, e = —k, g = Ai/2, and in = X\/(2k). 

Hence 

E (Z - duA z )) 2 = -k 2 (w + 9)(j)g,cr(w)a 2 + k 2 {9 - w)(j)g,a(-1 J ’)A 

+(A 2 /4)P e>(7 (|Z| > in) + k 2 (9 2 + A)PeA\Z\ < w). 

The risk of lasso is obtained from setting A 2 = 00 and Ai = A/ in the lava risk. The 
risk of ridge is obtained from setting Ai = 00 and A 2 = A r in the lava risk. 
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As for the risk of post-lava, note that 

2 — 


^post-lava (^) 9 


\z\ > Ai/(2 k) 
(1 — k)z — 9, \z\ < Ai/(2fc). 


Hence applying Lemma A.l to F(z) = (ipost-iava^) —9, i.e. by setting h = f = 1, e = 1 — k 
and d = m = g = — 6 , we obtain: 

R(9, 9 post _ lava ) = a 2 [-k 2 w + 2 kw - k 2 9}p 8)a (w) + a 2 [-k 2 w + 2 kw + k 2 9](j)e t( j{-w ) 

+a 2 PeA\Z\ > w) + (A: 2 0 2 + (1 - k) 2 a 2 )P e A\ z \ < ™)- 

Finally, the elastic net shrinkage is given by 

1 


denet, (z) — 


1 + A 2 


(M - Ai/2)+sgn(z). 


The risk of elastic net then follows from Lemma |A.l by setting F(z) = d en et(z ) — 9, 
w = Ai/2,/r = / = 1/(1 + A 2 ),e = 0, d = —Ai/(2(1 + A 2 )) - 9 and g = Ai/(2(1 + A 2 )) — 9. 


A.4. Proof of Lemma 


2.2 


For 0i ava = (0iavaj)Li, we have 


A 2 


1 + A 2 ’ 


E||@lava - 0\\l = Y, R{0j,Kv*j)- 
3 = 1 

We now bound R(9j,9\ ava j ) uniformly over j = We have that 

^lavaj — 9j = (1 — k)Zj + k5j — 9j = [(1 — k)(Zj — 5j) — Pj\ + k(5j — 5j), k = 

where 5j = diassoC-^j) with penalty level Ai. Hence 

E (9j - 9jf = E[(l - k)(Zj - 5j) - fa} 2 + k 2 E{6j - 5jf 

'--V-----" "--V-■-' 

7 II 

+ 2kE{(6 J - <5,)[(1 - k)(Zj - 6j) - Pj}} 

'-V-^ 

III 

Bounding I. Note that Zj — 5j ~ N(Pj, u 2 ). The first term Gj = (1 — k)(Zj — 5j) — Pj 
is thus the bias of a ridge estimator, with EG 2 = (1 — k) 2 a 2 + k 2 p 2 . 

Bounding II. Note that Ai = 2 ct < J>~ 1 (1 — c/(2p)). By Mill’s ratio inequality, as long as 
> log P> 2^/2log p > Ai/cr > 2Vlogp. In addition, 

E(5j - 5j ) 2 = E(<5j - 9j) 2 + P 2 + 2E(Sj - 9j)Pj. 


Since Zj N(9j,a 2 ), by Theorem |2.l| with A; = X\/k > 2a^f\ogp (since k < 1), 
E (5j - 9j) 2 = (A//2 + 9j)p ej A^i/2W + (% - V2)<V CT (-A,/2)u 2 


^(i) 


+{\i/A + a 2 )P e ., (T {\Zj\ > Ai/2) + 9 2 E d] A\Z j \ < A,/2) 

■ 2 / j . & /0_fl .A2/o^2 


(A 2 /4 + a 2 


V2n(\i/2-9j) 


= —(Ai/2—0j) 2 /2cr 2 
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<( 2 ) 

<(3) 

<(4) 


+(A?/4 + a 2 ) 


a 


2 X (A z /4 + a ) 


\/2vr(Az/2 + Q j 
2' 4(7 


(Az/2+Sj)^ 

-e ^ 3— +0?P^(l Z il< V2) 


_^ e -A?/(32^) +0 2 

v / 2vrA/ 


f^ e -A?/(32^) + 0 2 
\Z2na J 

4cj2 =e -A?/(64^) + ^2 < 


4a 2 


y/2 7T 


J v^7Tp( 1 / 4 )'" 


+ i 


where (1) follows from the Mill’s ratio inequality: e t2 ^ 2 dt < x i e i 2 /2 £ or x > o. 

Also note that A//2 ± 6j > 0 since ||#||oo < M and u^/logp > 2 M. Hence we can apply 
the Mill’s ratio inequality respectively on P e ,a(Zj > A//2) and P g , a (Zj < —Xi/2). 
In addition, the first two terms on the right hand side are negative. (2) is due to 
Xi/2 ± Qj > A;/4 since \6j\ < M and o\/\ogp > 2M. (3) follows since 4 a 2 < X 2 when 
p > e. Finally, for any a > 0, and any x > 1 + a -1 , ax 2 > logs. Set a = 64 _1 ; when 
\f\ogp > 33, log(A;/(j) < A 2 /(64cr 2 ). Hence ^-e~ x i^ 32rj2 ^ < e _A ? /464 °' 2 ), which gives (4). 
Therefore, 


E E <% 

3 = 1 


W < 


4 pa 2 


y/ZirpO-/ 4 )' 


+ m 


On the other hand, applying (A.l) and by the same arguments as above, uniformly 
for j = 1 ,...,p, 


|E<5j| = \a 2 ^ dj AXi/2) ~ fojA- Ai/2)] + (0j ~ A,/2)P g.^Zj > Xi/2) 
+{0 j + Xi/2)Y>e J AZ j < — Ai/2)| 

< l°L p -\?/{32a 2 ) < 8(7 1 

- V2/r - V2^P 1/8 ' 


Moreover, ||#|| 2 + \\/3\\ 2 - 

fm-5,) 2 

3 = 1 


2 > I 1 - i^A./ = ll^lli- Hence 

< E - %) 2 + ll^lli - 2 E + 2 E E ^; 


< 


i=i 


ll^lli + 


4p<r 2 

V / 27rp 1 / 16 


ll/3||oo 


3 = 1 

lOup 


i=i 


Donoho and Johnstone 


Bounding III. By 
Ai/(2 k)). Hence (note that P(Zj — Qj) = 0) 


(1995), E [(Qj - Zj)Sj] = -a 2 Y 9j A\Zj\ > 


E[(% - 5j){Zj 




< 

< 


E[Sj(Zj-Qj)]+E[Sj-Sj)Pj\ 
a 2 P dj A\Zj\ > X 1 /(2k)) + /5 j E6j 


2(7 


2 4(7 A 2 /(32cr 2 ) 


\/27tA i 


4o- 2 


v/ 27T v/logpp 1 / 8 


+ ft-E^- 
+ /JjrEJj — Sj/3j, 


fijpj 
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implying 


E - 5jW - k)(Zj - 5,) - fr]} 

3 = 1 

(i - *o E - w -w- E + E 

i=i j=i i=i 

4a 2 p 1 p p 


< (1-fc)- 


+ (1 /5j,E(5y + k ^2 bjfij 


'V^rVlogpP 1 / 8 ' ^ 

4f7 2 „ I P _ P 

(1 * - ‘E« + t E^A 

- (1 - L A-7-h ^ xl'Tf-!x ~ E 


< M 


—1=—ttf + &E (since M 2 logp > 16a 2 , k < 1, ||/3||oo < Af). 

\/27r p ' z —f 

7=1 


A/27T v/logpp 1 / 8 
p 


V^pl/ 8 ' 

7=1 


Summarizing, we obtain (note that ||0||| < ||/3|| 2 + 3sM 2 ) 

P Apa 2 


E E ^-^) 2 ^ P(1 - fc ) V + fc2 |^ll2 + fc 2 PllI + fc 2 ^ /1fi + 

3— 1 f v----1---- 


7/ 


+ 2 kM^L^ + 2fc 2 E S 3^3 


V^ip 1 / 8 


7=1 


III 
\2„2 , 7.21 


= p(l - k) 2 a 2 + k 2 \\/3 + (5||| + k 2 


4 pU 2 ,-, nt 2 io7 \ Afap 


-v/^vrp 1 / 16 


+ (10fc 2 + 18 k) 


< p{l - k) 2 a 2 + k 2 (\\/3\\2 + 3sM 2 ) + 


Apa 2 28 Map 


\/2ttp 1 / 8 


+ 


-v/^rp 1 / 16 y/^TTp 1 / 8 


Finally, due to E ||Z — 6\\ 2 = pa 2 , we have 


El^lava ~ Of 

e\\z- e \\ 2 


k 2 


k 2 


28 M 


— (1 — k) H- q 11/?112 4- ^3sM H— .— - + ,— . 

pa 2 pa 2 -v/^vrp 1 ^ 16 a\Z2np 1 / 8 

, 3sM 2 4 28M 

= l-fc +-;5- + -=——+ 


pa 2 -v/^rp 1 ^ 16 G\f 7 h\p 1 l 8 

9 fc 2 ||/3|| 2 

where the last equality is due to (1 — /c) J H— 2 2 = 1 — k for k 


(T p 


\\‘i+° 2 p' 


A.5. Proof of Theorem 2.3, The first result follows from equation (A.5) in the proof 


of Theorem 2.1; the second result follows directly from (2.7). 
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Appendix B. Proofs for Section 3 


B.l. Proof of Theorem 3.1 Let Q Aa = [X'X + nX 2 I p \■ Then for any 8 6 


X{m + S} = X{Q~^X'(Y-X8) + 8} 

= P\ 2 Y + (I p - P Aa )X8 = P A2 Y + K A2 X5. 

The second claim of the theorem immediately follows from this. 

Further, to show the first claim, we can write for any 8 € M p , 

||y _ xm - X6\\ 2 2 = || (I n - P A2 )(T - X6)\\l = || K Aa (T - XS)\\l 
n\ 2 \\m\\ 2 2 = n\ 2 \\Q^X'(Y-X8)\\l 
The sum of these terms is equal to 


(Y - XS)'[Kl a +n\ 2 X Q- 1 Q- 1 X']{Y - XS) = || < 2 (F - XS)\\l 

where the equality follows from the observation that, since K| 2 = I n — 2X Q^ 1 X' + 
X Q' 1 X'X Q^ 1 X' and [X'X + n\ 2 I p ] Q' 1 = I p , we have 

K| 2 +nX 2 X Q" 1 Q- 1 X' = I n -2X X' + X Q" 1 [X'X + nX 2 I p } Q^ 1 X 1 

= I n — X Q^ 1 X' = I — P\ 2 = K\ 2 . 


Therefore, after multiplying by n, the profiled objective function in (3.3) can be expressed 
as: 

a/2/ v vx\\\2 


K^(y-^)||^ + nA 1 ||<5|| 


i- 


This establishes the first claim. 


B.2. Proof of Theorem |3.2 Consider the following lasso problem: 

h x {y) := argmin j^||y - K^ 2 X8\\% + A||<5||i j . 


Let g\(y,X) := Xh\(y), where X := K^X. By Lemmas 1, 3 and 6 of 
Taylor (2012), y H > g\ 1 (y,X) is continuous and almost differentiable, and 

dg\ Av,X) 


Tibshirani and 


dy 


= XAX'jX.j)-X'r 


J'jJ' 
1/2 


Then by Theorem |3.l[ Xd lSLVa (y, X) = Xh Xl (K^y) = g Xl (K^y, X). Therefore, 


V„-(K X , Xdu.Ay, X)) = tr (*'£ — 


= trK 2 XAX‘jXj)-X'jK Jf) 


It follows from ( |3. 10 ) that 


df(0) = tr(P A2 ) + Etr(K^ XjiX'jXjrX'j K^ 2 ) 

= tr(P Aa ) + Etr (Xj(X'jX J )-X' J (I - P^ 

= tr(P Aa ) + Etr (XjiX'jXjyX'j) - Etr (Xj(X'jX j)~X'j P Az ) 
= Erank(Aj) + Etr(Kj P Aa ). 
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B.3. Proof of Lemma 


3.1 


Note that A'6*i ava + P ; U = P -,Y + K; X$i ava = P Y + 


K j X(3 + Kj X5 and X 0 pos t-iava = P j Y + K j X/3. Hence it suffices to show that K j X5 = 
0. In fact, let 5 1 be the vector of zero components of 5, then X 6 = X j5 1 . So K ? X5 = 


K j Xj 6 j = 0 since K j Xj = 0. 


B.4. Proof of Theorem 3.3 Step 1. By (3.6), 


-WXO^-XOoWl < -||K A2 A(?-5o)||2 + -||Dndge(A 2 )||I 

n n n 

,1/2 


< -II X(S-S 0 m\\ Ka 2 I! + — 1| D ridge (A 2 ) H 2 , 


n 


since || K Aa || < 1 as shown below. Step 2 provides the bound (Bi(5o) V H 2 (A)))|| K A2 
for the first term, and Step 3 provides the bound B 3 + i ?4 (/3o) on the second term. 
Furthermore, since X' K Aa X = nA 2 S’(S' + A 2 /)^ 1 , we have 


B 2 (P 0 ) = ^ll^olll = y/ 0 X' K A2 xp 0 = 8 \ 2 / 3 qS(S + A 2 I) _1 /3 0 . 


Also, to show that || K Az H 2 < 1, we let P Aa = U\D\U( be the eigen-decomposition of 
P A2 , then || K A2 || = \\U\(I — D\)U[\\ = ||J — Z?i||. Note that all the nonzero eigenvalues of 
D\ are the same as those of (X'X + reA 2 I)~ 1 / 2 A' , X(X / X + nA 2 /) -1 / 2 , and are {dj/(dj + 
n A 2 ), j < min{n,p}}, where dj is the jth largest eigenvalue of X'X. Thus \\I — D i|| = 
max {maxj 12 X 2 /(dj + nA 2 ), 1} < 1. 

Combining these bounds yields the result. 

Step 2. Here we claim that on the event || ^-X’ , C/|| 00 < c 1 Ai, which holds with 
probability 1 — a, we have 


1 ~ ^ 4A 2 

-ms - <50)111 < 2 ( A \ \ \ v 

n t (c,do, Ai,A 2 J n 


4 2 ||AA-" 2 


011 2 


= B 1 (6 o )VB 2 (0 o ). 


By (3.4), for any 5 G 


-\\Y - X 6 \\% + A 1 ||?|| 1 < -\\Y - XSo\\l + Ai||5 0 ||i. 
n n 


Note that Y = XSq+U +A/?o, which implies the following basic inequality: for A = <5—do, 
on the event |||;A' / t/|| 0O < c _1 Ai, 


— ||ACA.|| 2 < Ai ||5o||i — ||<5o + A||i + 

n \ 


-AX 'U 
n 


+ 2 


-(XA)'(XPo) 

n 


— Ai[ 11 Aq||i — ||do + A||i + c 1 A||A||i I + 2 


1 ~ A 


1 

—AA 


— —X[3q 

y/n 

2 

y/n 


2 
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or, equivalently, 


-ILYA 

n 


1 - 


7n X P 0 


ilA 


< Ai 


i — II + LA 111 + c 1 A 11 A 111 ). 


If 


-Lja 


< 4 




— ||AA||n < 2Ai 


, then we are done. Otherwise we have that 

i — IAo + A||i+c 1 11A || i 


Thus A E 1Z(c, So, Ai, A 2 ) and hence by the definition of the design-impact factor 

1 


1 ~ ~r = ||-X"A||2 

-||AA|||<2A 1 - ' 

n i[c, o 0 , Ai, A 2 ) 




Combining the two cases yields the claim. 

Step 3. Here we bound 77 1| D r i t j ge (A 2 ) |||- We have 


|||D ridge (A 2 )||l < ±||K A2 A/3o||| + ^|| Pa 2 u 111. 


By Hsu et al. (2014)’s exponential inequality for deviation of quadratic form of sub- 
Gaussian vectors the following bound applies with probability 1 — e: 

4 

-II Pa 2 U\ 

n 


< 



n 

< 

CN 

b 

^ 1 


n 

< 



n 


\A r ( p a 2 ) + ^2^11 P| 2 II \/log(l/e) 


= Bo, 


where the second inequality holds by Von Neumann’s theorem (Horn and Johnson 
( 2012 )), and the last inequality is elementary. 

Furthermore, note that K Aa X = A 2 Al( 5 + A 2 /) _1 . Hence 


H 4 (/3 0 ) = -|| K A2 XPoWI = 4A|/3 '(S + A 2 I)~ 1 S(S + A,/)" 1 #) = 4#,^ a A). 
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